Text Analytics and Transcription Technology for Quranic Arabic
نویسندگان
چکیده
Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodicsyntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this project is to view tajwīd mark-up in the Quran as additional text-based data for computational analysis. This mark-up is already incorporated into Quranic Arabic script, and identifies phrase boundaries of different strengths, plus lengthened syllables denoting prosodically and semantically salient words. We have developed a graphemephoneme mapping scheme (Brierley et al 2016), plus state-of-the-art software (Sawalha et al 2014) for generating a stressed and syllabified phonemic transcription or citation form for each word in the entire text of the Quran, using the International Phonetic Alphabet (IPA). This canonical pronunciation tier for Classical Arabic is informed and evaluated by Arabic linguists, tajwīd scholars, and phoneticians, and published in an open-source Boundary-Annotated Quran corpus and machine learning dataset (ibid). We utilise statistical techniques such as keyword extraction to explore semiotic relationships between sound and meaning in the Quran, invoking a Saussurean-type view of the sign as ‘...a bi-unity of expression and content...’ (Dickins 2007). Our investigation entails: (i) text data mining for statistically significant phonemes, syllables, words, and correlates of rhythmic juncture; and (ii) interpretation of results from interdisciplinary perspectives: Corpus Linguistics; tajwīd science; Arabic Linguistics; and Phonetics and Phonology.
منابع مشابه
Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank
The Quranic Arabic Dependency Treebank (QADT) is part of the Quranic Arabic Corpus (http://corpus.quran.com), an online linguistic resource organized by the University of Leeds, and developed through online collaborative annotation. The website has become a popular study resource for Arabic and the Quran, and is now used by over 1,500 researchers and students daily. This paper presents the tree...
متن کاملMorphological Annotation of Quranic Arabic
The Quranic Arabic Corpus (http://corpus.quran.com) is an annotated linguistic resource with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year old central religious text of Islam. This paper...
متن کاملA Pilot PropBank Annotation for Quranic Arabic
The Quran is a significant religious text written in a unique literary style, close to very poetic language in nature. Accordingly it is significantly richer and more complex than the newswire style used in the previously released Arabic PropBank (Zaghouani et al., 2010; Diab et al., 2008). We present preliminary work on the creation of a unique Arabic proposition repository for Quranic Arabic....
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملArabic Part of speech Tagging using k-Nearest Neighbour and Naive Bayes Classifiers Combination
Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Curre...
متن کامل